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Abstract 

Today ’s deep learning systems deliver high performance 
based on end-to-end training but are notoriously hard to 
inspect. We argue that there are at least two reasons mak- 
ing inspectability challenging: (i) representations are dis- 
tributed across hundreds of channels and (ii) a unifying 
metric quantifying inspectability is lacking. In this paper, 
we address both issues by proposing supervised and unsu- 
pervised Semantic Bottleneck (SB) layers we integrate into 
pretrained networks to align channel outputs with individ- 
ual visual concepts and introduce the model agno Stic AUiC 
metric to measure the alignment. We present a case study on 
semantic segmentation to demonstrate that SBs improve the 
AUiC up to four-fold over regular network outputs, while 
recovering state of the art performances. 

1. Introduction 

While single output loss training (end-to-end) is key to top 
performance of deep learning - it is also the main obsta- 
cle to obtain inspectable systems. A key problem is that 
all intermediate representations are learned without inter- 
pretability as explicit objective leaving them opaque to hu¬ 
mans. Furthermore, assessing inspectability has remained 
a fairly elusive concept since its framing has mostly been 
qualitative (e.g. saliency maps). Given the increasing inter- 
est in using deep learning in real world applications, inter- 
pretability and a quantification of such is critically missing. 

Desiderata for inspectability. To address this, we demand 
information in each channel to be represented by a single 
semantic (sub-)concept, similarly to how the task of clas- 
sification enforces semantic meaning on the output predic- 
tions. This is derived from a simple observation: distributed 
representations do not lend themselves to trivial interpreta- 
tion. Hence, we desire to adapt deep networks to reduce 
distributed representations by (i) reducing the number of 
channels to a minimum, (ii) associating them with semantic 
(sub-)concepts, and, at the same time, (iii) aiming to lose 
as little overall performance as possible. In our view such 
semantics based inspectability can be seen as a way towards 


achieving true interpretability of deep networks. 

Our contributions are three-fold. Firstly, we introduce two 
network layers we term Semantic Bottlenecks (SB) based 
on linear layers to improve alignment with semantic con¬ 
cepts by (i) supervision to visual concepts and (ii) regular- 
izing the output to be one-hot encoded. Secondly, we show 
that integrating SBs into a state-of-the-art architecture does 
not impair performance, even for low-dimensional SBs that 
reduce the number of channels from 4096 to 30. Finally, 
we introduce the novel AUiC metric to quantify alignment 
between channel outputs and visual concepts for any model 
and show our SBs improve the baselines up to four-fold. 
Based on our knowledge, we are first to show such a modu¬ 
lar approach that substantially improves inherent model in¬ 
spectability without losing performance of state-of-the art 
classification models. Combined with our AUiC, our ap¬ 
proach is general and easy to apply to any model. 


2. Related Work 

As argued in prior work [5], interpretability can be largely 
approached in two ways. The first being post-hoc inter- 
pretation, for which we take an already trained and well 
performing model and dissect its intemals a-posteriori to 
identify important input features via attribution or group- 
ing [1, 7, 8,10, 11, 14]. The second approach is to design 
inherently interpretable models, either with [3,4] or with¬ 
out supervision [6] . In contrast to our work, these models 
are generally not designed to be modular for application in 
modem classification architectures and are challenging to 
integrate or fail to reach good performance. In order to in¬ 
vestigate the inspectability of deep networks, Bau et al. pro- 
posed NetDissect - a method counting number of channels 
associable to single visual concepts [1]. Our AUiC metric 
leverages the ideas of NetDissect and extends it to satisfy 
three criteria we deem important for measuring inspectabil¬ 
ity - which NetDissect does not satisfy. 
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Broden+ object 

Sky 

Building 

Person 

Road 

Car 

Lamp 

Bike 

Van 

Truck 

Motorbike 

Train 

Bus 

# subordinate parts 

1 

5 

14 

1 

9 

3 

4 

6 

2 

3 

5 

6 

Broden+ material 

Brick, Fabric, Foliage, Glass, Metal, Plastic, Rubber, Skin, Stone, Tile, Wood 


Table 1: Relevant concepts from Broden+ for the Cityscapes domain. 
Material concepts in bottom row and parts are grouped by their respective 
parent object (top 2 rows). 


3. Semantic Bottienecks 

To approach more inspectable intermediate representations 
we demand information in each channel to be represented 
by a single semantic concept. We propose two variants to 
achieve this goal: (i) supervise single channels to represent 
a unique concept and (ii) enforce one-hot outputs to encour- 
age concept-aligned channels and inhibit distributed repre¬ 
sentations. We construet both variants as layers that can 
be integrated into pretrained models, mapping intermediate 
representations to a semantic space. We name these super- 
vised and unsupervised Semantic Bottienecks (SB). 

Case Study. To show the utility of SBs, we choose Street 
scene segmentation on the Cityscapes dataset [2]. We use 
PSPNet [12] based on ResNet-101. 

3.1. Supervised Semantic Bottienecks (SSBs) 

Variant (i) supervises each SB channel to represent a sin¬ 
gle semantic concept using additional concept annotations. 
One (or multiple) linear layers receive the distributed inputs 
from a host model and are supervised using an auxiliary 
loss to map them to target concepts. These predictions are 
concatenated and fed into the next layer of the host model. 
Choosing concepts for Cityscapes. For our supervised 
SB-layer we choose concepts based on task relevancy for 
Cityscapes. Broden-\- [9] is a recent collection of pixel level 
annotations including parts and materials. We select 70 con¬ 
cepts we deem task relevant (see table 1). 

Implementation details. Since the Broden concepts are 
not defined on the Cityscapes images, we train SSBs on a 
pretrained host which parameters are kept fix. We insert the 
SSB at two different locations: block4 and pyramid. We 
choose single 1 x 1-conv layers as classifiers for this task, 
which are integrated into the host network after training by 
adjusting the input dimensionality of the next layer. After 
integration, all downstream layers are finetuned. 

3.2. Unsupervised Semantic Bottienecks (USBs) 

Clearly, the requirement for additional annotation and the 
uncertainty regarding choice of concepts is a limitation of 
SSBs. To address this limitation, we investigate the use of 
an annotation free method to (i) reduce number of channels, 
(ii) increase semantic association and (iii) lose as little per- 
formance as possible. To approach point (ii) we propose un¬ 
supervised semantic bottienecks (USBs) that enforce non- 
distributed representations by approaching one-hot encod- 
ings. In the following we investigate the use of softmax 
activations as a means to address this point. 


3.2.1 Construction of USBs 

We keep using the same bottleneck framework as for SSBs, 
but add a softmax activation function on the outputs of 1 x 1- 
conv layers that we regularize accordingly to achieve one- 
hot encodings. Parameterizing softmax with a temperature 
T, it approaches argmax when T ^ 0. In our setup, we 
start with a high T, e.g. Tq = 1 and reduce it to = 0.01 
in r training iterations to approach arg max. 
Implementation details We start with a pretrained PSPNet, 
integrate the USB and finetune all downstream layers plus 
the USB itself. During inference, we compute arg max in- 
stead of softmax to acquire one-hot encodings. 


4. Quantification of Layer Inspectability 

We present the AUiC metric enabling architecture agnos- 
tic benchmarking, measuring alignments between channels 
and visual concepts. We specify three criteria that AUiC 
has to satisfy: (i) it must be a scalar measuring how well a 
set of visual concepts can be aligned to channel outputs. (ii) 
The metric must be model agnostic to allow comparisons 
between two different activation functions. (iii) The quan¬ 
tification must be computed unbiased w.r.t. the concept area 
in the image. The fundamental ideas inspiring our metric 
are based on the frequently cited NetDissect method [1]. 

4.1. AUiC metric 


Our proposed metric involves two steps. 

Channel-Concept matehing. As first step, each channel 
needs to be identified as detector for a single concept. Given 
dataset X containing annotations for concept set C, we 
compare channel activations and pixel annotations Lc, 
where c e C. Since a channel output is continuous, it 
needs to be binarized with a threshold 6k acquiring binary 
mask Mk = M(k^6k) = A^ > 0^. Comparison can subse- 


quently be quantified with a metric like loU(x) = 

I • I being cardinality of a set. A few things need to be con- 
sidered w.r.t. to our criteria. loU penalizes small areas more 
than large ones, since small annotations are disproportion- 
ally more susceptible to noise, which would become an is- 
sue later during optimizing 0. We address this issue using 
the mean loU of positive and negative responses to balance 

the label area by its complements and Lc. The align- 
ment score between channel and concept is subsequently 
defined over the whole dataset X: 


mIoUfc,,(X) 
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(1) 


We sum over all samples before computing the fraction to 
include samples not containing concept c. 

Secondly, the alignment between channel and concept is 
sensitive to 6k. We keep the determination of 6k ag¬ 
nostic to the activation distribution by finding critical 
point 01 ^ - now per channel and concept - maximizing 

mIoUfc^c(X 5 0k,c) - now parameterized with the threshold: 

C = argmaxmIoUfc,c(X,^fc,c)- U) 

’ ^fc,c 























channel 

trained 

concept 

loU 

assignment 

Our mloU 
assignment 

16 

person/hair 

torso (0.07) 
painted (0.06) 

person (0.57) 
hair (0.55) 

32 

lamp/shade 

painted (0.07) 
brown (0.06) 

shade (0.58) 
lamp (0.53) 

18 

person/foot 

torso (0.07) 
black (0.05) 

person (0.53) 
foot (0.53) 

69 

wood material 

brown (0.07) 
painted (0.07) 

wood (0.53) 
floor (0.52) 


Table 2 : Channel identification comparison for SSB@pyramid using ei- 
ther loU or mloU. The latter reducing size bias substantially. 

This leaves \C\ concepts per channel, for which we identi- 
fied the best thresholds. The final assignment is performed 
in a last step, choosing concept c* maximizing mloU 

c* = argmaxmIoUfc,c(X, (3) 

Each concept can be assigned to multiple channels, but not 
vise versa. 

Scalar quantity. The second step involves summarizing 
the identifiability to a scalar value - 0 indicating no channel 
can be identified and 1 all. Given a global mloU thresh- 
old ^ we can determine the fraction of channels having a 
greater mloU. In order to keep the metric agnostic to the 
choice of we define the final AUiC as the AUC under the 
indicator function - counting identifiable channels - for all 
e G [0.5,1]: 

AUiC = 2 

4.2. Discussion 

We conclude by showing that AUiC satisfies our three cri- 
teria and delineate it to the related NetDissect-measure. 
Ciear definition in [0,1]. 0 must indicate no channel align- 
ment - 1 perfect alignment for all channels. AUiC satisfies 
this criteria as it integrates over all mloU thresholds. Net- 
Dissect instead chooses a specific loU threshold ^ = 0.04 
giving a false sense of security since all channels only re¬ 
quire to pass this threshold. 

Agnostic to model. To enable comparison aeross diverse 
types of models, we require a metric agnostic to the distri- 
bution of outputs. AUiC satisfies this criteria since it con- 
siders the threshold ^ that maximizes mloU. In NetDis- 
seets measure in contrast, the activation binarization thresh¬ 
old 6>/c is chosen based on the top quantile level of activa- 
tions ak G such that P{ak > 0^) = 0.005. This fails 
for non-Gaussian distributions, e.g. Bernoulli, for which 6k 
could wrongly be set to 1, resulting in to be always 0. 
Insensitivity to size of concept. To show size bias using 
loU, we conduct a comparison between loU and mloU. We 
compare concept assignments on SSB@pyramid since the 
channels are pre-assigned. Table 2 presents the assignments 
of each method (columns) for four channels (rows). mloU 
assignments are consistent with the trained concepts, even 
identifying concept wood. Using loU instead, concepts like 
painted, or black are among the identified. These concepts 
cover large areas in Broden images making them less sus- 
ceptible to noise. The average pixel portion per image of 
painted for example is 1087.5, resulting in an loU of 0.06, 
while hair has only 93.8 pixels on average and does not 
show up when using loU. mloU on the other hand computes 



AUiC AUiC 

Figure 1: AUiC - inspectability scores for SSBs (yellow), USBs (blue) 
and baselines (red). Higher values are better, 1 being perfect channel- 
concept alignment. SBs substantially improve that alignment and thus: 
inspectability. E indicates number of channels. 

a score for hair of 0.55 for channel 16, which is trained for 
hair. NetDissects metric uses loU, for which the authors 
manually adjusted the threshold to make it unbiased 113]. 
Since this adjustment is done for normal distributions, it’s 
not guaranteed to be unbiased for others. 

5. Results 

We here show, that introducing Semantic Bottlenecks (SBs) 
achieve all of our three goals (i)-(iii). To assess the semantic 
alignment of channels (goal (ii)) we utilize our AUiC metric 
to show improved inspectability for SBs over baseline lay- 
ers. Additionally, we plot the mloU performances showing 
we can recover performance, achieving goal (i) and (iii). 

5.1. Setup 

Datasets. We compare alignments with three different 
datasets to cover a wide range of concepts from different 
domains. The broadest dataset we evaluate is Broden 11] 
which covers 729 various concepts of categories like ob- 
ject, part, material, texture and color (skipping scene con¬ 
cepts). Since the Broden images are mostly out of do- 
main w.r.t. Cityscapes, we evaluate Cityscapes-Classes and 
Cityscapes-Parts, a dataset we introduce to include subordi- 
nate concepts to the 19 classes. The new dataset includes 11 
coarsely annotated images covering 38 different concepts. 
Compared models. Given 70 channels for SSBs, we 
choose 25 and 36 channels for USBs. 

5.2. Quantitative improvements 

We compare vanilla PSPNet with SSBs and USBs and do 
so for outputs of block4 and the pyramid layer. The plots 
for these two layers are presented in columns in figure 1. 
Each row shows results for a different dataset in this order: 
Cityscapes-Classes, Cityscapes-Parts and Broden. PSPNet 
layer outputs are indicated by color red, SSBs by yellow and 
USBs by blue. 

SSBs enable inspection for subordinate concepts. On 

each layer and dataset except Cityscapes-Classes, SSBs out- 
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Figure 2: Top-20 Broden aligned channels from SSB-, USB- and vanilla 
PSPNet outputs. Each color is mapped to a single output channel. 

perform baselines. Most encouragingly, SSBs improve the 
AUiCs on Cityscapes-Parts from under 0.1 to over 0.3 for 
both block4 and pyramid making a big leap forward towards 
inspectable representations. 

USBs delineate Cityscapes-Classes related concepts. In 

comparison to SSBs, USBs align very well with Cityscapes- 
Classes. The increase from 0.05 to over 0.4 AUiC on block4 
is especially remarkable. 

5.3. Qualitative improvements 

To support our quantitative results we supply visualizations 
of SB-layers in comparison to baselines. We show that SB 
outputs offer substantially improved spatial coherency and 
consistency. To enable comparison between lOOOs and lOs 
of channels, we utilize the mloU scoring of our metric to 
rank channels. We show the top-20 channels, assigning 
each a unique color and plotting the arg max per location. 
Based on our discussion of inspectable channels, this will 
resuit in coherent activations for unique concepts if a chan¬ 
nel is aligned. Visualizations are presented in figure 2 for 
all tested layer locations. 

PSPNet outputs in the first row (Vanilla) show that they 
are very difficult to inspect even if ranked by best aligned 
channels. This leads us to believe that representations are 
highly distributed across channels. 

SSB outputs. Attending to the first image on the left half 
of figure 2, we see spatial coherency greatly improved for 
SSB and USB outputs over baseline. In particular, note 
the responses for SSB@block4 which show a distinction 
into wheels (blue color), car Windows (dark orange color) 
and person-legs (light gray color). A similar distinction 
for SSB@block4 can be seen for the second input image, 
where there is one channel activated for the upper body 
(mint green) and one for the legs (light gray). We find 
that our SB-layers offer dramatically increased inspectabil- 
ity that give insights into strong correlations between chan¬ 
nel output and distinet input features. 

USB outputs. In relation to the SSB outputs, the 
USBs appear to form representations that are early aligned 


with the output classes, which is especially evident 
for USB @ pyramid where the visual distinction between 
classes is already very sharp. 

6. Conclusion 

We proposed supervised and unsupervised Semantic Bot- 
tlenecks (SBs) to align deep representations with seman¬ 
tic concepts. Additionally, we introduced the AUiC met¬ 
ric quantifying such alignment to enable model agnos- 
tic benchmarking and showed that SBs improve baseline 
scores up to four fold while retaining performance. 
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